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Abstract. 

Background: In bioinformatics, tools like multiple sequence alignment and 
entropy methods probe sequence information and evolutionary relationships between 
species. Although powerful, they might miss crucial hierarchical relationships formed 
by the reuse of repetitive subsequences like duplicons and transposable elements. Such 
relationships are governed by “evolutionary tinkering”, as described by Francois Jacob. 
The newly developed Ladderpath theory provides a quantitative framework to describe 
these hierarchical relationships. 

Results: Based on this theory, we introduce two indicators: order-rate 
n, characterizing sequence pattern repetitions and regularities, and ladderpath- 
complexity «x, characterizing hierarchical richness within sequences, considering 
sequence length. Statistical analyses on real amino acid sequences showed: (1) Among 
the typical species analyzed, humans possess relatively more sequences with large « 
values. (2) Proteins with a significant proportion of intrinsically disordered regions 
exhibit increased 7 values. (3) There are almost no super long sequences with low 
n. We hypothesize that this arises from varied duplication and mutation frequencies 
across different evolutionary stages, which in turn suggests a zigzag pattern for the 
evolution of protein complexity. This is supported by our simulations and examples 
from protein families such as Ubiquitin and NBPF. 

Conclusions: Our method emphasizes “how objects are generated”, capturing the 
essence of evolutionary tinkering and reuse. The findings hint at a connection between 
sequence orderliness and structural uncertainty, and suggest that different species 
or those in varied environments might adopt distinct protein elongation strategies. 
These insights highlight our method’s value for further in-depth evolutionary biology 
applications. 


Keywords: Evolutionary tinkering, Ladderpath theory, Reuse and modularity, Algorithmic 
complexity, Shannon entropy, k-mer, Hierarchical structure, Organismal complexity, Intrinsically 


disordered proteins, Protein elongation. 


1. Introduction 


Bioinformatics approaches based on sequencing data have effectively demonstrated that 
DNA and amino acid sequences are encodable. This encodability has been illuminated 
by employing a range of potent mathematical and statistical techniques, revealing their 
biological significance. Various studies have suggested strong correlations between the 
structural features in sequences (such as regularity and nestedness) and the functional 
properties of proteins, indicating the profound link between sequence structure and 
biological function [1-3]. 

One commonly used approach to characterize the sequential features is the Shannon 
entropy (defined as H = — © p; log, pi where p; is the probability of observing letter i) 
and its variants [4-6]. It was originally proposed to describe the uncertainty of a random 
variable, but later adopted to characterize the sequential randomness, behind the idea 
that a sequence can be thought of as a realization of a sequential array of this random 
variable. Shannon entropy is often applied to assess sequence divergence and sequence 
polymorphism [7-11]. It represents a statistical notion of information and is insensitive 
to the internal structure and pattern of an individual sequence. Shannon entropy could 
also be pushed forward to analyze the frequency distribution of short subsequences— 
namely the k-mer method—instead of individual letters, and investigate simple and 
non-overlapping repetitions [5, 6, 12-15]. However, this type of approach overlooks the 
internal hierarchical relationships in a sequence that was found to be very important at 
the protein domain level [1, 13, 16,17] or even in language [18]. 

Instead of focusing on the statistical notion, another group of methods used to 
characterize sequential features and structural information includes approaches such as 
Kolmogorov complexity and its variants. These methods aim to provide the shortest 
description for a specific target object. One particularly insightful variant is the 
“effective complexity” proposed by Gell-Mann [19,20]. It suggests that the complexity of 
a sequence can be gauged by analyzing its regularities or repetitive subsequences. It has 
been found that effective complexity is closely related to a sequence’s functional features. 
This perspective on complexity underscores the significance of sequential repetitions and 
duplications. Intriguingly, considering the abundant presence of repetitive structures in 
the human genome, one might argue that the genomes of higher eukaryotes, including 
humans, exhibit greater complexity from this structural viewpoint [21-23]. 

Nonetheless, while describing complexity is essential, offering insights into how 
this complexity evolves is another side of the coin. The amino acid sequence of a 
protein not only embodies information about its thermodynamics, folding, and other 
properties (Anfinsen’s principle) but also encapsulates details related to its evolutionary 
trajectory and history, which could be extracted. In 1977, Francois Jacob posited the 
abstract idea that evolution is akin to “tinkering” [24], or more specifically, innovations 
arise from the opportunistic reuse or recombination of existing elements. Much of this 
tinkering occurs during replication errors, for example, through point mutation and DNA 
duplications. The latter is associated with various replication events, such as duplicons 
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and transposable element expansion [22,23], leading to increased complexity in both 
protein families and genomes [21,25]. Various examples that reflect this tinkering process 
exist: the length of bacteriophage tails determined by TMP [26], the needle length 
of bacterial injectisome by YscP [27,28], cytochrome P450 in insects [29], antifreeze 
glycoprotein in codfish [30], the widespread presence of zinc finger proteins [31,32], and 
extensive core duplications in primates [33]. Many of these proteins have undergone 
significant expansion and mutation, either actively or passively. Yet, the challenge of 
quantifying such a tinkering process remains. 


@) PP b) QQ (c) VK QL TA GY 


CPP EP QQQQ QVQLQQSDAELVKPGASVKI 
SCKVSGYTFTDHTIHWMKQR 
PEQGLEWIGYIYPRDGSTKY 
NEKFKGKATLTADKSSSTAY 


PKCP EPCPP PV 22QQQQQQQ MQLNSLTSEDSAVYFCAR 
QQQQQ00QQQ a 
CPPPKCP EPCPPPVCC 690000000 = 
EPCPPPKCPE = 2000000000 € 
PCPPPVCC QQQQQQQQQQ 5 
QQQQQQQQQQ f 
QQQQQQQQQ 3 
MSYYQQQCKQPCQPPPVCPP i 
PKCPEPCPPPKCPEPCPPPV 
CCEPCPPPKCPEPCPPPVCC O O si 
EPCPPPVCCEPCPPQPWQPK Q0QQQQQQQQQQQQQQQQQQ o 00 200 300 400 500 
CPPVQFPPCQQKCPPKNK QQQQQQQQQQQQQQQQQQQQ Size-index (S, i.e. length) 


Figure 1. Laddergraphs and a distribution for human proteins. (a) The laddergraph 
for the protein SPR2B_MOUSE, where the string at the bottom represents this target 
protein and shorter strings above are ladderons. The most basic building blocks, 
namely, individual amino acid, are omitted for better visualization. Its S = 98, A = 50 
and w = 48. (b) The laddergraph for the protein ATX8_HUMAN, with S = 80, \ = 12 
and w = 68. (c) The laddergraph for the protein AOA075B674_MOUSE, with S = 98, 
à = 94 and w = 4. (d) The distribution of order-index w vs. size-index S, for human 
proteins with lengths below 500 AA. 


The Ladderpath theory is a recently proposed framework to quantitatively describe 
the structural information of objects such as sequences, molecules, proteins, and images 
[34]. It considers the shortest path to generate the target object as the way to 
characterize it, with the key assumption that the building blocks, once generated, can 
be reused in any amount in subsequent steps. These reused building blocks are called 
ladderons, which could also be viewed as modules, as defined in ref. [34]. This aligns with 
the “tinkering” process proposed by Frangois Jacob [24]. The number of steps required 
for ab initio generation of a target object indicates its generation difficulty, defined 
as the ladderpath-index, A. Hence, when considering a set of amino acid sequences 
with the same length, one can discern which sequence is more straightforward or easier 
to generate. Additionally, to characterize the degree of order in sequences of varying 
lengths, another useful index, called the order-index, is defined as w := S — A, where 
S is the size-index (namely, the length) of the amino acid sequence. By deconstructing 
the target object into a partially ordered multiset—or equivalently, the laddergraph (as 
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shown in Fig. la - c)—the Ladderpath theory characterizes the structural intricacies 
rooted in the hierarchical and overlapping relationships formed by the target object’s 
repetitive substructures. 

The concept of the Ladderpath theory aligns with several other theories, including 
Kolmogorov complexity, addition chain, assembly theory, and the “adjacent possible” 
[35-38]. While these theories have their own measures of complexity, the Ladderpath 
theory posits that “complexity” should be assessed using both the ladderpath-index 
and the order-index [34]. A sequence is not necessarily complex if it only has a high 
ladderpath-index with a low order-index (Fig. 1c), or vice versa (Fig. 1b). A sequence 
can be deemed complex if both indices are simultaneously high. Of the three real 
proteins examined, the one with both a high ladderpath-index and order-index (Fig. 1a) 
exhibits the most intricate and complex hierarchies. For a more encompassing view, Fig. 
1d shows the distribution of human proteins with lengths below 500 amino acids (AA). 
The Ladderpath theory underscores nature’s propensity to innovate through tinkering 
and reusing existing structures, a trend exemplified in processes like the evolutionary 
creation of new proteins. 

This paper is organized as follows. In Section 2.1, we provide a rigorous definition 
of the order-rate and ladderpath-complexity, and present a systematic comparison 
with a commonly used k-mer related method. Sections 2.2 and 2.3 present two 
statistical observations. The former reveals that human protein sequences exhibit 
higher ladderpath-complexity. The latter notes that proteins containing a significant 
portion of intrinsically disordered regions, on average, possess a higher order-rate. 
Both observations are statistically significant. In Section 2.4, we begin by detailing a 
statistical observation that there are almost no super long sequences with low order-rate 
values. We speculate that this might be due to the different frequencies of duplication 
and mutation across different evolutionary stages. This, in turn, suggests that the 
evolution of protein complexity follows a zigzag pattern. We offer several examples of 
protein families to support this speculation. The paper concludes with a discussion and 
a methods section that describes the algorithm for computing ladderpath-associated 
information. Open-source code is also available. 


2. Results 


2.1. Two indicators that characterize amino acid sequences 


Firstly, we have developed an efficient algorithm to compute the ladderpath-associated 
information of sequences, details of which can be found in the Methods Section, with 
codes available on GitHub for immediate use. This algorithm can effectively handle 
sequences of around or below 10,000 AA (beyond which an approximation can be made), 
in contrast to the previous algorithm (see ref. [34]) that was limited to sequences of 
approximately 20 AA. The statistics displayed in Fig. 1d were derived using this new 
algorithm. 
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Moving on to Fig. 1d, we noted a distinct lower boundary for the order-index w as 
the sequence length S increases. This lower boundary stems from the finite number of 
basic building block types (which in this context are the 20 amino acid types), because as 
the length of amino acid sequences increases, repetitive subsequences become inevitable, 
resulting in a non-zero value for w. This is purely a mathematical property, which we 
need to compensate. Hence, we introduce two new indicators—the order-rate 7 and 
ladderpath-complexity «—to better characterize the system with a finite number of 
basic building block types. 


Order-rate 7. We define the order-rate of a sequence x as 
w(x) — wols) 
D)i= 1 
ee Wmax(S) — wol S) (1) 


where w(x) is the order-index of sequence x, S is the size-index of x (namely, the 


length of x), Wmar(S') is the maximum order-index among all the sequences with 
length S, and wọ(S) is the average order-index of all possible sequences with length 
S, roughly corresponding to the average level of the least ordered sequences, referring 
to Supplementary Information (SI) section 1 for the calculations of wo and Wma. 

The order-rate 7 characterizes the hierarchical and overlapping relationships among 
the subsequences of a sequence, describing the pattern regularities and repetition in the 
target sequence. Values of 7 close to zero mean that the degree of order of the sequence 
is close to the average level of random sequences, indicating that the sequence does not 
exhibit any significant pattern. As 7 gets larger and larger, meaning that the repetitive 
parts become more dominant and the sequence exhibits more hierarchical structures 
(see Fig. la). 7 reaches 1 only when the sequence exhibits exponential elongation of a 
single letter, e.g., T > TT —> TTTT > TTTTTTTT. 


Ladderpath-complexity «. Another indicator we put forward to characterize the 
internal structure of sequences is the ladderpath-complexity Kk, defined as: 


K(x) := A(x) - n(x) (2) 
where A(x) is the ladderpath-index of sequence x, and n(x) is the order-rate of x. 
As mentioned, the order-rate 7 is a relative indicator of the regularities (compared 
with the average level of totally random sequences and the most ordered sequence), 
so its relevance might diminish across sequences of disparate lengths. This indicator 
ladderpath-complexity «, instead, takes into account the minimum number of steps 
required for the generation of the sequence that is characterized by A, thereby including 
the length effect. As demonstrated in the Ladderpath theory that the “complexity” of 
a sequence should incorporate two aspects, that is, one is the difficulty in generating 
the target, and the other aspect focuses on the hierarchical and interlaced relationships 
within the internal sequential structure [34], the definition of «x integrates these two 
aspects, hence its name: ladderpath-complexity. 
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For a given length (namely, size-index S), the maximum value of the ladderpath- 
complexity « can be anticipated (see SI section 2 for the mathematical properties of 
k). That is, when w = (S + wo)/2 and àA = (S — wo)/2, the ladderpath-complexity 
k(S) reaches its maximum value (S — wo)?/[4(Wmax — Wo)|. In the special case where 
wo = 0, «K reaches its maximum when w = Aà = S/2 (note that wa appears in the general 
case because of the baseline effect mentioned above). It indicates that when both w 
and A are large, the ladderpath-complexity « could be large (if only one of w or A is 
large, x cannot reach its maximum). This is consistent with the notion that complexity 
incorporate two aspects. 


Examples and comparative analysis. Next, we take a few protein sequences as 
examples (with diverse 7 and « values) to more clearly and intuitively illustrate what 
n and « characterize (Tab. 1 and Fig. 2). We can observe that: (1) PO5F1 MOUSE 
has an order-rate 7 close to 0, meaning that the characteristic features of its internal 
structure are indistinguishable from those of random sequences (from Fig. 2a we can see 
its few hierarchical structures). (2) As the order-rate 7 increases, the sequence starts 
to exhibit richer hierarchical and interlaced structures, with diverse and overlapping 
ladderons (Fig. 2b) while, as 7 approaches 1, the hierarchy becomes more like a simple 
layer-by-layer structure (Fig. 2c). (3) Although PO5F1-MOUSE and SDK2_MOUSE 
have similar small order-rate 7, the latter has a much higher ladderpath-complexity 
k, just because the latter is much longer. Meanwhile, although SRY_MOUSE is much 
shorter than SDK2_MOUSE, its ladderpath-complexity « is even slightly higher because 
of its greater order-rate 7 (from Fig. 2b we can see its much richer hierarchical and 
interlaced structures). This indicates that length affects complexity but is not the sole 
determinant. 


Table 1. Indicators characterizing protein sequences. 


Examples of proteins sequences (entry name) 


PO5F1MOUSE SRY- -MOUSE UBC HUMAN SDK2 MOUSE 


Indicator 


size-index (5) 352 392 685 2176 
ladderpath-index (A) 279 210 73 1379 
order-index (w) 73 182 612 797 
order-rate (7) 0.0442 0.3581 0.8870 0.0545 
ladderpath-complexity (x) 12.3181 75.1944 64.7484 75.1940 
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Figure 2. Laddergraphs of the four example protein sequences presented in Table 
1. Unlike in Fig. 1, space constraints prevent the explicit display of ladderons in this 
figure. Instead, ellipses are used to symbolize ladderons, with the size of each ellipse 
corresponding to the length of the ladderon. Subfigures (a), (b), and (c) are scaled 
identically, as evidenced by the corresponding size of the largest ellipse that represents 
the target sequence in each. In subfigure (d), due to the excessive length of the protein 
SDK2_ MOUSE, only a zoomed-out version of its laddergraph is displayed. A detailed 
version can be found in SI section 3. 


Now, we will compare the indicators proposed in this study with another commonly 
used method. As mentioned, a commonly used tool to describe the sequential feature is 
the Shannon entropy, which is, however, based on the statistical notion of the frequency 
and the uncertainty of single letters, rather than the internal structure of a sequence. 
Nevertheless, the k-mer method has been employed to extend the notion for single letters 
to substrings of a certain length. Chen et al. introduced a normalized indicator named 
Informational Complexity (C) to characterize the relative uncertainty of substrings [5]. 
C; is calculated based on a sliding window of a fixed length k, and thus, the internal 
sequential structure has been taken into account, at least within the range of k. In 
fact, Cı is the Shannon entropy of the sequence (because 1-mer is just the single letter), 
normalized to the maximum Shannon entropy of the same length. To draw a linguistic 
analogy, the Shannon entropy functions at the alphabet level, while the k-mer version Ck 
constructs a dictionary comprising words of a certain length k, quantifying the Shannon 
information conveyed by these fixed-length words. Consequently, the quantity (1—C;), 
denoted as R;z, represents the degree of regularity, partially aligning with what the 
order-rate 7 describes. 

Then, we systematically compare the order-rate 7 with Rẹ (Fig. 3). We observe 
a correlation between 7 and R,, and the correlation increases as k increases to 2 and 
3; After k > 3, the correlation begins to drop sharply (Fig. 3a). The correlation 
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exists when k = 1,2,3 because both indicators, 7 and R,, correctly describe certain 
aspects of the sequence’s regularity. Note that the order-rate 7 quantitatively describes 
the hierarchical and interlaced relationships among the substructures of a sequence. 
Therefore, it has a higher correlation with R3 and Rə, while the correlation with R, is 
lower. This is because 3-mer and 2-mer take substructures into account, while R, merely 
focuses on single letters, neglecting the internal structure. Further, the correlation 
decreases after k > 3 because the whole set of all possible k-mers expands exponentially 
with k, and thus the Shannon information contained in k-mers becomes submerged in 
the whole set, resulting in Rẹ becoming less and less informative. 


œ) 1 a a “ “ a 
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Figure 3. Systematic comparison between R, and the order-rate 7. (a) Spearman 
correlation between 7 and Rx, as k increases, for six distinct species. (b) Scatter 
plots of 7 vs. Rp for k = 1,2,3,4,5 and 6. Each row corresponds to a different 
species. Individual dots within the plots represent individual proteins. (c) Several 
representative proteins are chosen (denoted in red, green, and yellow colors) to show 
how R, changes as k increases up to 50. Note that the red curves in this subfigure 
correspond to the red dots in (b), and similar associations are made for the green and 
yellow curves; Each row represents a species, corresponding to (b). 


Another observation is that while a general correlation exists, different proteins 
exhibit varying tendencies as k increases. For instance, the proteins represented by 
the red points in Fig. 3b, which have large 7 values, tend to retain their position 
along the x-axis as k increases from 2 to 6; In contrast, proteins represented by the 
blue points descend rapidly along the x-axis. This suggests that these different protein 
sequences have distinct internal structures. To further probe the influence of these 
internal structures, we chose several representative proteins to analyze how Rg changes 
as k increases up to 50. Figure 3c illustrates this, where red curves correspond to the 
proteins represented by the red points in Fig. 3b, and similar associations are made 
for the green and yellow curves (referring to SI section 4 for the ladderpath-associated 
indicators of these representative proteins). We observe that: (1) The red proteins 
are actually those that have large repetitive segments, but lack rich hierarchical and 
interlaced relationships (e.g., Ubiquitins), and thus having a relatively high 7 but low 
k. For them, we observe that Rẹ remains virtually unchanged as k increases. (2) The 
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green proteins have very “chaotic” sequences (i.e., almost no repetitive subsequences), 
resulting in a low 7. For these proteins, Rẹ approaches zero after k > 3. (3) The yellow 
points fall between these two categories, exhibiting a distinct feature: They decrease 
slowly with k, hinting at intriguing internal structures. 

To summarize, for proteins with distinct internal structures (such as the three 
exemplified categories), the characterizing capability of different R; varies. As a species 
likely contains at least these three categories of proteins, it remains largely arbitrary to 
determine which k should be used to characterize the sequential features of the species 
as a whole. Our approach, instead, effectively characterizes the internal structure and 
provides a global indicator without predefining a characterizing range. Intuitively, the 
Ladderpath-associated indicators liberate “confined length words” (k-mers) to “variable 
length words” (the so-called ladderons, as defined in ref. [34]), adeptly capturing the 
hierarchical and interlaced structures within sequences. 


2.2. Statistical observation: Human protein sequences have higher 
ladderpath-complexity K 


Here, we present the density distribution of ladderpath-complexity « for sequences with 
lengths below 2500 AA across six typical species (Fig. 4a). The statistical differences 
between distributions reflect species-specific features. We observe that the distribution 
for human is the flattest, i.e., having the highest proportion of proteins with large «K. In 
contrast, the distribution for E. coli appears to be more concentrated, i.e., having the 
highest proportion of proteins with small «. To put it another way (referring to Tab. 
2), in terms of the density of proteins with large « (e.g., k = 80,60,40), human and 
mouse rank at the top, forming the first group, followed by the second group (yeast, 
mouse-ear cress, and C. elegans), and finally, Æ. coli. However, for proteins with small 
k (e.g., k = 5,10), the first group consists of E. coli, C. elegans and mouse-ear cress, 
followed by yeast, and finally the third group of mouse and human. This observation 
aligns with previous findings suggesting that more complex species (such as human and 
mouse) tend to have longer protein lengths, more segmental repetitions, and more types 
of cells [39,40], and that Æ. coli has fewer internal duplications [2] (and hence lower 
ladderpath-complexity). 
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Figure 4. Overview of the ladderpath-complexity of protein sequences across six 
typical species. (a) Density distribution of protein sequences with lengths below 2500 
AA, with respect with ladderpath-complexity «. (b) The average « and the change in 
k after shuffling. 


Table 2. Data from the density distribution in Fig. 4, for particular « values. 


Density (%), for specific x value 


Organism 

k=5 k=10 «Kk=20 Kk=40 Kk=60 K=80 
H. sapiens (Human) 3.04 3.46 2.15 0.66 0.21 0.10 
M. musculus (Mouse) 3.32 3.77 2.11 0.57 0.19 0.08 
A. thaliana (Mouse-ear cress) 4.24 4.34 2.19 0.33 0.08 0.02 
C. elegans 4.82 4.62 1.65 0.30 0.08 0.03 
S. cerevisiae (Yeast) 3.67 3.88 2.16 0.60 0.10 0.04 


E. coli 5.42 4.43 1.82 0.13 0.02 0.00 


Considering that wọ comes from the average w of numerous random sequences with 
homogenous amino acid content, could the large difference primarily result from the 
species-specific and inhomogeneous content rather than the internal sequence structure? 
To test this speculation, we randomly shuffled all sequences—aiming to preserve the 
amino acid composition but disrupt the internal structures—then recalculated their 
ladderpath-complexity, and compared the changes before and after, denoted as Ax (Fig. 
4b). The results indicate that human sequences still have the most significant reduction 
(the order remains the same), suggesting that human sequences possess the richest 
hierarchical and interlaced structures overall, with Æ. coli having the least. So, the 
statistical differences in ladderpath-complexity arise from the internal sequence structure 
rather than the inhomogeneous content. 

With these results, we provided a quantitative description at the protein level that 
the protein sequences of more complex species tend to possess richer hierarchical and 
interlaced structures. 
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Showcase: Top list of large-x proteins. Let us now examine the list of proteins 
with the highest « values, considering only those sequences with lengths below 2500 AA 
(Tab. 3). Interestingly, despite this length limitation, our «-selection results show a 
similarity to the findings of the repeat finder: human proteins dominate [41]. Adjusting 
this length limit to 2000, 1500, or even 1000 AA does not change this observation (see 
SI section 5 for more data). 

Another notable observation is the length range that spans from 1457 to 2496, 
indicating that length is not the determining factor for K; Instead, repetition in the 
sequence plays a significant role. For example, DMBT1, flocculin and mucin in Tab. 3 
are protein classes that are famous for tandem repeats [42—44]. 


Table 3. Top 25 large-« protein sequences with length limit below 2500 AA. 


Protein (Entry name) Organism S À w n K 
DMBT1_HUMAN ! H. sapiens (Human) 2413 779 1634 0.518 403.40 
FILA2- HUMAN H. sapiens (Human) 2391 931 1460 0.417 388.04 
Q20007_CAEEL C. elegans 2311 889 1422 0.426 378.69 
FILA2 MOUSE M. musculus (Mouse) 2362 949 1413 0.399 378.23 
DMBT1-MOUSE ! M. musculus (Mouse) 2085 752 1333 0.470 353.09 
APOA- -HUMAN H. sapiens (Human) 2040 632 1408 0.548 346.64 
HORN- MOUSE M. musculus (Mouse) 2496 481 2015 0.714 343.44 
CR1-HUMAN H. sapiens (Human) 2039 824 1215 0.408 335.89 
TRHY -HUMAN H. sapiens (Human) 1943 811 1132 0.391 317.06 
Q6DIC6_MOUSE M. musculus (Mouse) 2087 969 1118 0.314 304.41 
F186A_MOUSE M. musculus (Mouse) 1790 716 1074 0.424 303.47 
MUC22_HUMAN ? H. sapiens (Human) 1773 700 1073 0.432 302.34 
Q9LH98_ARATH A. thaliana (Mouse-ear cress) 2081 1032 1049 0.267 275.09 
AQAOB4J1F9_MOUSE M. musculus (Mouse) 1599 665 934 0.409 272.18 
PWWP4_HUMAN H. sapiens (Human) 2061 1031 1030 0.261 269.30 
FLO1_YEAST 2 S. cerevisiae (Yeast) 1537 596 941 0.451 268.92 
NACAM HUMAN H. sapiens (Human) 2078 1049 1029 0.254 266.16 
CO4A2_CAEEL C. elegans 1758 828 930 0.320 265.13 
F7C950_ MOUSE M. musculus (Mouse) 1606 410 1196 0.642 263.12 
Q9LIE8_ARATH A. thaliana (Mouse-ear cress) 1480 477 1003 0.549 261.97 
NBPFC_HUMAN H. sapiens (Human) 1457 532 925 0.489 260.13 
TARA- HUMAN H. sapiens (Human) 2365 1249 1116 0.206 257.28 
SON- -HUMAN H. sapiens (Human) 2426 1289 1137 0.199 255.93 
Q63ZW6_MOUSE M. musculus (Mouse) 1691 805 886 0.316 254.43 
CO4A5 HUMAN H. sapiens (Human) 1685 801 884 0.317 254.01 


1 belongs to the DMBT1 family. ? mucin protein. ? belongs to the flocculin family. 


2.8. Statistical observation: Proteins containing intrinsically disordered regions (IDRs) 
have higher order-rate ņ 


Now, let us consider the relationship between the amino acid sequence and its 
corresponding 3D structure. Intuitively, duplicated sequences could be expected 
to adopt identical structures. Therefore, a long sequence with many duplicated 
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subsequences (thereby tending to have higher order-rate 7) may be considered to have 
a consistent structure comprising explicit identical substructures [45]. For instance, the 
protein depicted in Fig. 5a exhibits a consistent and regular structure [46]; another 
notable example is the much larger protein DMBT1 HUMAN, shown in Fig. 5b, which 
also has a high 7) value, as shown in Tab. 3. Nevertheless, there are proteins with high 7 
values but are structurally disordered, as the example depicted in Fig. 5c, which exhibits 
regions predicted by AlphaFold2 with low confidence, implying structural disorder. 


(d) From DisProt database (e) From Metapredict software 
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Figure 5. Statistics related to the order-rate, 7, of proteins containing a significant 
proportion of IDRs. (a) Structure of an artificial protein DeNovoTIM15, from 
the Protein Data Bank (PDB:6wvs). (b) Predicted structure of the protein 
DMBT1_HUMAN by AlphaFold2. (c) Predicted structure of HORN_MOUSE by 
AlphaFold2. (d) The right part with darker colors shows the average 7 for proteins 
containing a significant proportion of IDRs compared to proteins without a significant 
proportion of IDRs. The left part shows the changes in 7 after shuffling the sequences. 
The corresponding data size, n, from the DisProt database is indicated. (e) This is 
similar to subfigure (d), but the data are from calculations using the disorder predictor 
software Metapredict, for the six proteomes. Note that **** means p < 0.0001, *** 
means p < 0.001, ** means p < 0.01, and ns means “no significance”. 


To uncover statistical patterns, we utilized data from the DisProt database [47] 
to calculate the average order-rate for proteins with a significant proportion of 
intrinsically disordered regions (IDRs), and compared it with other proteins without 
a significant proportion of IDRs. The results are shown in Fig. 5d (the right part with 
darker colors). It is evident that, generally, proteins with IDRs have higher 7 values 
than those without such regions, which is statistically significant for four out of six 
species analyzed. Yet, due to the limited data available in DisProt, we also employed 
the Metapredict software [48] to predict the presence of IDRs in all proteins of their 
proteomes for these species, and then matched them with their respective 7 values. The 
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outcomes of this analysis are presented in Fig. 5e. The pattern remains consistent, with 
clear statistical significance observed for five out of these six species. These findings are 
consistent with previous studies that tandem repeats, especially perfect ones, tend to 
be structurally disordered [49]. 

Nevertheless, previous studies show that proteins containing IDRs have a greater 
amino acid abundance bias [50,51]. It is thus possible that the high 7 value arises 
from this bias rather than from the orderliness of the internal sequential structure. To 
investigate this, we compared the 7 value before and after shuffling, denoted as An, 
as shown in Fig. 5d (the left part with lighter colors). We observed that, statistically 
speaking, An is larger for proteins containing IDRs. From this observation, we can 
suggest that the internal sequential structure plays a role in the high 7 values. Therefore, 
the degree of orderliness may serve as a new feature of disordered regions at the sequence 
level. 


2.4. Evolution of the complexity of protein sequences follows a zigzag pattern 


We now present statistics that encompass all sequences (Fig. 6), not just those shorter 
than 2500 AA. Generally, most of these sequences have 7 values confined below 0.1 
(Fig. 6a). However, a closer examination of the 7 distribution (Fig. 6c) reveals that the 
proteins of human, mouse and C. elegans exhibit a significant tendency: as the length 
of the sequence increases, there tend to be a higher number of sequences with larger n, 
indicating more ordered sequences. This trend is also observable in Fig. 6a, where, for 
extremely long lengths, sequences with low 7) values are even absent. 

An immediate question is why there are almost no super long but low-7 sequences. 
Later we shall see that this question strongly relates to how protein sequences elongate. 
Now, imagine there is initially a short sequence or segment, and consider how this 
sequence elongates and how the order-rate 7 evolves, via specific biological processes: 


e Duplication: It refers to the process where a segment of a sequence, either short or 
long, is copied onto itself. This creates a repetitive subsequence, corresponding to a 
ladderon as defined in Ladderpath theory. As a result, 7 of this sequence increases. 
The longer the segment, the greater the increase in n. 


e Substitution: It refers to the replacement of a base. This does not alter the 
sequence’s length, but it may disrupt a ladderon, thereby slightly decreasing the 
value of n. 


e Insertion: It could be thought of as either the addition of a foreign segment or 
a single amino acid, or as a duplication of a segment immediately followed by 
substitutions occurring at every base. 


Note that for simplicity, we only consider the processes that does not shorten the 
sequence, thus neglecting deletion. 

We now simulate the process of elongation in three cases: (1) completely driven 
by duplication, (2) completely driven by insertion, or (3) driven by a combination of 
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Figure 6. Observations and simulation experiments related to protein elongation. (a) 
Scatter plot of protein lengths S vs. order-rate 7, for all proteins across the six species. 
See the legend of subfigure (c) for the six species. (b) Similarly, a scatter plot of S vs. 
ladderpath-complexity «. (c) The average 7 values of proteins vs. protein length S, for 
the six species. Each dot (S, average 7) is calculated using proteins within a sliding 
window centered at a specific length S. (d) Results of simulation experiments showing 
how ņ evolves as the protein sequence elongates, for the three different cases elaborated 
in the main text. (e) Similarly, simulation experiments showing how « evolves. 


duplication and substitution. The simulation results are displayed in Fig. 6d (referring 
to SI section 6 for details on how the simulation was conducted). Although the 
simulation focuses solely on the elongation of protein sequences, it provides insight 
into the question of why there are virtually no extremely long sequences with low 7 
values. The red trajectories in Fig. 6d represent case (1), where 7 increases the most 
rapidly during elongation. The green trajectory represents case (2), where the order-rate 
7 remains consistently low. The yellow trajectories, representing case (3), lie in between 
and closely resemble real-world scenarios where infrequent duplications of relatively large 


15 


segments heavily increase 7, while frequent substitutions consistently reduce 7, forming 
a zigzag pattern. We also see from Fig. 6e that « increases the most in case (3), namely, 
the yellow trajectories. 

In summary, the evolution of protein sequences follows a zigzag pattern. 
Specifically, the duplication of segments increases the order-rate of the sequence as 
it elongates, while this increment in order-rate is gradually counteracted by various 
mutations, either partially or completely, depending on the relative frequencies of 
duplications and mutations. Now, we could consider the emergence of a new gene or 
pseudogene: (1) Occasionally, a replication error leads to the duplication of a segment 
at a different location within the sequence, resulting in higher 7 and « values and 
contributing richer raw materials for further evolution. (2) Subsequently, this elongated 
sequence undergoes various “tinkering” processes across generations, reducing 7 and 
K. Over time, this sequence gradually diverges from its ancestor and may eventually 
become a new gene or a pseudogene. 


Examples: Ubiquitin, Titin and NBPF family. Now we can return to the 
observation mentioned at the beginning of Section 2.4 and ask why there are almost 
no extremely long but low-7 proteins. Here we provide three representative examples 
to address this question. 

The first example is Ubiquitin, which is used to emphasize the effect of duplication. 
Ubiquitin is a highly conserved, small regulatory protein widely found in eukaryotes, 
which functions as a post-translational modifier, mainly in protein degradation. 
Polyubiquitin (UBB and UBC) has an extremely high 77 value because it contains almost 
no mutations and has several tandem head-to-tail repeats of ubiquitin, each being 76 AA 
long [52] (referring to Fig. 2c for the laddergraph of UBC_HUMAN). The distribution 
of 7 and « values for this protein family is shown in Fig. 6a and 6b. We can see that 
while some members of this family have an extremely high 7 value approaching 1, their 
corresponding ladderpath-complexity, x, is not particularly high. This observation can 
be attributed to the nearly error-free duplication events, aligning with case (1) discussed 
earlier, and the lengths of these proteins which are not particularly long. 

The second example is another extreme, the ancient protein Titin, which is used 
to emphasize the effects of mutations along protein elongation. Titin serves as a 
structural support in muscles and is of immense length (e.g., TITIN-HUMAN contains 
364 exons) [53]. This gigantic protein consists of numerous domains, some of which 
belong to the PEVK region, which is rich in highly repetitive sequences (this PEVK 
region forms a distinct structure in the center of the protein, functioning as an entropic 
spring [54]). Nevertheless, the 7 values of Titin are not very high and exhibit variations 
among different species [55,56], as shown in Tab. 4. This suggests that the effects of 
duplications, which can increase the hierarchical and nested structures of sequences, 
have been largely counteracted by long-term and consistent mutations. 

The third example is an emerging family at the evolutionary scale, Neuroblastoma 
BreakPoint Family (NBPF), which lies in between the two extremes mentioned above. 
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Table 4. Ladderpath-associated indicators of Titin proteins. 


Protein (Entry name) Organism S À w n K 
TITIN_DROME D. melanogaster (Fruit fly) 18141 7843 10298 0.244 1759.78 
TTN1CAEEL C. elegans 18562 7545 11017 0.300 2166.39 
AOA7M7N314 STRPU S. purpuratus (Sea urchin) 24046 8692 15354 0.308 2853.28 
AOA8M9QKG2_ DANRE D. rerio (Zebrafish) 31468 14202 17266 0.120 1671.51 
TITIN HUMAN H. sapiens (Human) 34350 15001 19349 0.141 2071.50 
TITIN_MOUSE M. musculus (Mouse) 35213 15295 19918 0.144 2175.60 


NBPF is known for its members having varying numbers of Olduvai repeats, with 
approximately twenty members in humans, playing a certain role in human brain 
development and cognition [57,58]. These young proteins seem to be predominantly 
found in proteomes of primates, whereas in non-primate mammals, their counterparts 
exist as single-copy Olduvai. As an amplicon, Olduvai has undergone a significant gene 
amplification within a relatively short time span [57,59]. Thus, 7 and « increased 
significantly, and mutations had not had enough time to largely lower 7 and « to 
counteract the effect of duplication (Tab. 5). Therefore, from Fig. 6a, we can observe 
that the NBPF family members form a clear pattern, exhibiting their evolutionary 
trajectory. 


Table 5. Ladderpath-associated indicators of the gene family NBPF. 
Protein (Entry name) Organism S À w n K 


NBPF5 HUMAN 
NBPF7 HUMAN 
NBPF3 HUMAN 
NBPF4 HUMAN 
NBPF6 HUMAN 
NBPFF_HUMAN 
NBPFB HUMAN 
NBPF8_ HUMAN 
NBPFP_HUMAN 
NBPFE-HUMAN 
NBPF9_HUMAN 
NBPF1HUMAN 
NBPFC_HUMAN 
NBPFA_HUMAN 
NBPFJ_HUMAN 
NBPFK_HUMAN 


sapiens (Human) 351 268 83 0.082 22.07 
sapiens (Human 421 321 100 0.065 20.84 
sapiens (Human 633 438 195 0.114 49.93 
(Human 638 464 174 0.066 30.82 
sapiens (Human) 638 466 172 0.062 29.01 
(Human 670 445 225 0.145 64.48 
(Human 865 479 386 0.268 128.20 
(Human 869 492 377 0.251 123.33 
sapiens (Human) 902 419 483 0.386 161.61 
(Human 921 420 501 0.396 166.41 
(Human) 1111 522 589 0.362 188.89 
(Human) 1214 523 691 0.410 214.31 
( 
( 
( 
( 


1457 532 925 0.489 260.13 


) 

) 3795 584 3211 0.760 444.00 
sapiens (Human) 3843 491 3352 0.801 393.28 

) 5207 248 4959 0.927 229.84 


The aforementioned classes of proteins illustrate the elongation seen in ancient 
proteins and the emergent core duplication found in longer proteins, which can be 
metaphorically described as an Odyssey-like journey. These examples suggest that, over 
a long duration, there is a certain degree of synchronization between size expansion and 
increased complexity; while between expansion events, complexity tends to decrease. 
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Thus, the evolution of sequence complexity appears to follow a zigzag pattern. Most 
long proteins do not exhibit the same level of extremity as Ubiquitin and Titin, but 
instead fall somewhere in between, e.g., NBPF. For more examples of such proteins, 
refer to SI Section 7. 


3. Discussions 


3.1. On definitions 


The newly developed Ladderpath theory aims to decode the information concealed 
within the hierarchical and interlaced relationships among the recurring subsequences 
found in a specified set of target sequences. It achieves this by iteratively identifying 
recurring subsequences (termed the ladderons) and rearranging them into a tree- 
like hierarchical structure (termed the laddergraph), which distills and encodes the 
evolutionary information. In the context of biological sequences, these recurring 
subsequences, or ladderons, could represent motifs, domains, or signify transposable 
elements, satellite DNA, microduplications within genome scale, and the like. To better 
encapsulate the tree-like hierarchical structure, two indices were derived. The first is the 
order-index 7, which, in a normalized manner, quantitatively measures the orderliness of 
a sequence, ranging from close to 0 (completely disordered, as illustrated in Fig. 2a) to 
1 (fully ordered, as illustrated in Fig. 2c). When 7 sits centrally, the structure exhibits 
significant order while the ladderons display intricate overlaps and nested relationships 
(as illustrated in Fig. 2b). At this point the other derived index, K, reaches its maximum, 
signifying the utmost complexity. The ladderpath-complexity K gauges complexity by 
factoring in both orderliness and the length. While sequence length does contribute to 
complexity, longer does not necessarily equate to more complex. 

Ladderpath differs from Shannon entropy in that the latter primarily focuses on the 
statistics of individual letters, although extensions, such as the k-mer method [5,6], can 
be adapted to consider substructures. These approaches did not factor in the intricate 
hierarchical relationships among these substructures. Thus, our order-rate 7 shows 
a correlation with Rg, an index derived from the k-mer method, but this correlation 
varies with different internal sequential patterns. Further, Shannon entropy and its 
variants operate under a strong assumption that the sequence in question represents a 
realization of a random variable, implying that the sequence should be infinitely long. 
However, in reality, amino acid sequences invariably have finite lengths. On the other 
hand, finite lengths mean that methods like the Lempel-Ziv lossless compression (those 
aiming to describe a form of absolute information) cannot achieve their optimal or 
shortest description [60]. This introduces significant variability when trying to deduce 
genuine evolutionary histories. In contrast, Ladderpath does not rely on assumptions 
of infinite length. 
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3.2. Statistical observations on sequential orderliness and complexity 


The first statistical observation, based on our examination of the ladderpath-complexity 
of proteomes, reveals differing complexity distributions among species. Among the six 
species analyzed, humans, followed by mice, possess relatively more sequences of high 
complexity that exhibit richer hierarchical and interlaced structures. We also confirmed 
using shuffling methods that this complexity does not stem from content differences 
(e.g., the so-called C-value paradox or enigma [61]) but arises from internal sequential 
patterns. From the perspective of protein structure, studies have shown that species 
with higher complexity possess more proteins with larger radii of gyration (signifying 
increased flexibility) and a higher degree of modularity [62]. On the other hand, our 
analysis from the sequential perspective implies that the more complex a species is, 
the higher the tendency for sequence complexity. (It is worth noting that although the 
definition of species complexity remains debated, in practice, biologists often employ 
varied metrics like the total cell types, genome size, or proteome size to gauge species 
complexity, whereas these metrics are often interrelated [62,63].) Collectively, these 
results hint at positive correlations between amino acid sequence complexity, protein 
structural modularity, and overall species complexity. 

Another statistical observation is that proteins with a significant proportion of 
intrinsically disordered regions (IDRs) tend to exhibit higher order-rate, 7, with 
statistical significance. It is crucial to note that these elevated 7 values usually do 
not exceed 0.1 (as shown in Fig. 5e). Within this range, a higher 7 invariably 
indicates richer hierarchical and interlaced structures, akin to transitioning from proteins 
examplified in Fig. 2a to Fig. 2b (not possible to Fig. 2c since such an extreme 
hierarchical relationship requires an 7 of approximately 0.8 or higher). Thus, our 
findings suggest that, at the sequence level, proteins with IDRs tend to have richer 
hierarchical and interlaced structures compared to typical proteins. This correlation 
between sequence orderliness and structural uncertainty is somewhat unexpected but 
intriguing, meriting further investigation. On the other hand, understanding that a 
higher 7 often originates from segment duplication, another intriguing question arises: 
Does the evolution of intrinsically disordered proteins (IDPs) involve more duplication 
events? Lastly, building upon the earlier point that more complex species have more 
proteins with higher modularity, a bold idea might be developed: Could IDPs be 
an essential stage in the evolutionary journey towards increasing protein modularity? 
Specifically, an amino acid sequence “core”, through occasional duplication events, 
generates repetitive subsequences along elongation (which naturally leads to an increase 
in 7), resulting in structures becoming more disordered and flexible, facilitating the 
exploration of various interactions, and ultimately leading to the fixation of structural 
modules. 
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3.8. On evolution 


Our results suggest that as the protein elongates, its complexity follows a zigzag pattern, 
originating from the interplay of duplication and mutation (the latter refers to processes 
such as substitution and insertion). Duplication results in a sharp increase in sequence 
orderliness and length, while mutation leads to a decline in orderliness, with the length 
remaining more or less unchanged, together leading to a significant diversity in the 
internal patterns of sequences. Owing to the interplay of these mechanisms and their 
varying occurrence rates, the internal structure of the sequence can become highly 
hierarchical and interlaced. This might result in proteins having distinct values of 
k, ņn and S (e.g., leading to different distributions between long and short proteins), 
potentially promoting a range of structures and functions. Statistically speaking, we 
did observe that 7 distributions diverge when protein length exceeds 2000 AA (Fig. 6c). 
This hints that various species, or those in varied environments, might adopt different 
elongation strategies or, in other words, different “tinkering” processes. For instance, 
the trend of evolving into multi-domains is more pronounced in eukaryotic proteins than 
in prokaryotes [64], suggesting that distinct biological elongation dynamics might be at 
play. The evolution of human-specific segmental duplications (HSDs) seems to exhibit 
varied patterns across different periods. During the human-chimpanzee divergence, there 
was a period of relative quiescence, succeeded by a spike in HSD occurrences and the 
emergence of new genes [65-67]. The previously mentioned NBPF experienced rapid, 
widespread duplications. The Olduvai domain, in particular, stands out as one of the 
most extreme and fastest copy number expansions in the human genome (with humans 
having about 300 copies, great apes 90-120, monkeys 30-40, and single or a few copies in 
non-primate mammals, while being absent in non-mammals), which has been strongly 
linked to human brain evolution and cognitive function [68]. Variations in elongation 
mechanisms, especially under diverse or rapidly changing environmental stresses, might 
be advantageous for quick adaptation [69], potentially accelerating the emergence of 
new structures or inducing dose-dependent effects [70,71], among other outcomes. 

In Fig. 6c, more detailed analysis and intriguing insights can be observed. The 
length of the E. coli proteome is interrupted around 2000 AA. Beyond this length, 
yeast and mouse-ear cress (as species with cell walls) show no increase in order-rate. 
Meanwhile, for the mouse and C. elegans, both multicellular species without cell walls, 
there is an evident rise in order-rate, with their trends aligning closely. At lengths 
greater than 3000 AA, the order-rate of human proteome experiences a sudden and 
significant surge. Based on these observations, we make the following speculations: (1) 
The juncture at which the E. coli length halts could be a pivotal point in the shift 
from prokaryotes to eukaryotes. Eukaryotes might have developed additional tools for 
sequence expansion and tools that augment the hierarchical and interlaced structure of 
sequences, especially those facilitating intragenomic duplication. This transition could 
be a landmark event differentiating eukaryotes from prokaryotes. (2) The patterns 
observed in yeast and mouse-ear cress indicate that being unicellular or multicellular 
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may not be a key factor affecting proteome orderliness, while having a cell wall might 
pose as an obstruction to increasing the order-rate. We speculate that the cell wall might 
hinder horizontal gene transfer between species, preventing elements with capabilities 
such as translocation and duplication from integrating and emerging as evolutionary 
tools. (3) The spike in order-rate seen in humans, relative to other eukaryotic species, 
raises the question: have humans undergone certain critical events or acquired novel 
genetic tools? If a stark contrast remains when comparing humans at this point with 
other non-human primates (taking the example of NBPF, as previously discussed), 
it might explain the profound impact of social development within human evolution. 
Collectively, these findings suggest a deeper exploration of evolutionary data using this 
approach or similar methodologies. It also underscores that the Ladderpath theory could 
harbor significant potential for more in-depth applications in evolutionary biology. 

On the other hand, the simulated evolutionary process obtained through alternating 
segmental duplication and mutation provides a better fit to actual evolutionary data 
than considering mutations alone. This phenomenon poses a significant challenge to the 
neutral theory [72] and constructive neutral evolution [73]. At the very least, it suggests 
that from the time of Darwin to current evolutionary biology theories [74], there has 
been an overemphasis on the role of mutations, neglecting the effects of gene duplication 
and transfer. As inferred from the sudden shifts shown above, gene duplication 
and transfer are likely the main ingredients for significant evolutionary leaps. This 
resonates with the endosymbiotic theory [75] and horizontal gene transfer [76] applied 
to explain genome expansion. Such observations hint at two important applications 
of the Ladderpath theory: (1) Identifying critical shifts and branching points in the 
entire evolutionary tree, examining whether new gene modules have been added, and 
pinpointing which of these modules have undergone extensive duplication and transfer 
in subsequent evolutionary bursts. The Ladderpath theory may address the problem of 
phylogenetic lineages that are obscured by chimeric, symbiotic, or reticulate evolutionary 
events, which may provide crucial insights into phenomena like the Cambrian explosion 
[77]. (2) In fields such as synthetic biology and enzyme engineering [78], as well as 
pharmaceutical engineering [79], the practice of directed evolution is mainly based 
on point mutations and mutation libraries. There is limited application of strategies 
involving extensive gene segmental duplication. Introducing the Ladderpath theory 
into these fields, and adopting the “alternating segmental duplication and mutation” 
strategy simulated in this study, may significantly enhance the rate and success of 
directed evolution. Furthermore, while gene duplication has found many applications in 
plant and animal breeding, issues like the adaptability of inserted duplicate fragments 
and their loss in subsequent generations have consistently hampered successful breeding 
rates [80]. Using the Ladderpath theory to determine the optimal ratio and strategy 
for duplication and mutation might offer improved tools for targeted breeding [81] and 
related biotechnological endeavors. 

The Ladderpath theory provides a theoretical framework and specific computational 
methods to quantitatively describe the complexity of target objects, such as sequences. 
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It focuses on “how objects are generated” rather than on emphasizing uncertainty, 
as in the case of Shannon entropy, or the efficiency of compression, as seen in 
lossless compression algorithms like Lempel-Ziv. The Ladderpath theory embodies the 
evolutionary tinkering process, highlighting the importance of “reuse” and “modularity”. 
While this paper demonstrates the usefulness of derived indicators such as order-rate and 
ladderpath-complexity, it is even more crucial to note that comprehensive information 
is stored in the laddergraph, which depicts the hierarchical and interlaced relationships 
among recurring subsequences, resulting from the evolutionary tinkering process. In 
practice, we can learn from the tinkering mechanisms of innovation that nature employs 
(along with sophisticated and powerful reductionist-like innovation) to help us construct 
complex targets or systems from simpler ones, e.g., peptide drug design (to be discussed 
in an upcoming paper) and synthetic biology. Using the Ladderpath theory as a tool 
to reverse-engineer species evolution might also offer valuable insights, facilitating the 
design of more effective directed evolution strategies, which could then be applied to 
fields such as crop breeding and even the design of bioprocesses. 


4. Methods 


4.1. Algorithm for computing ladderpath-associated information 


Here we show how the algorithm works by taking a target sequence CUCGACGACUAU- 
CUCGACAAUGACU as an example (Fig. 7a). Firstly, we search for the longest repet- 
itive subsequence in the target sequence and find CUCGAC, marked in blue. Secondly, 
we cut the target sequence into pieces so that the repetitive subsequences are isolated. 
As a result, we obtain a set of shorter sequences: [CUCGAC, GACUAU, CUCGAC, 
AAUGACU]. In the third step, we place one CUCGAC into a separate bag, which will 
then be used to construct the ladderpath. After this step, we have a set of sequences 
[GACUAU, CUCGAC, AAUGACU] remaining. These three steps constitute the module 
which we call “SEARCH, CUT, and REMOVE”, marked in green in Fig. 7a. 

Next, we treat the remaining set of sequences [GACUAU, CUCGAC, AAUGACU] 
as a “target” sequence and apply the “SEARCH, CUT, and REMOVE” module to this 
target. From this, we obtain another longest repetitive subsequence, GACU, which 
we place into the separate bag. We continue to apply the module until the original 
target sequence is completely segmented into its most basic building blocks. Finally, 
based on the order of removal, we construct the ladderpath as { G, A(3), C(3), U(3) 
j/ AU, GAC / CUCGAC, GACU }. It characterizes the hierarchical and interlaced 
relationships within the original target sequence, and has a one-to-one correspondence 
with a laddergraph shown in Fig. 7b. 
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Figure 7. The algorithm for computing ladderpath-associated information. (a) 
Flowchart illustrating the algorithm with a specific example. (b) The laddergraph 
of the exemplified target sequence, calculated by this algorithm. 


The source code for this algorithm is available on GitHub: 
https://github.com/yuernestliu/LadderpathCalculator. Note that for sequences below 
2500 AA, the code can handle everything efficiently. For sequences between 2500 and 
10,000 AA, the code is efficient in all aspects except for determining the order-rate 
n, as computing the accurate value of wo(S) for S > 2500 AA requires significant 
computational power. If wo(S) for S > 2500 can be estimated in some way, the 7 value 
can be determined. Thus, in Fig. 6, we used interpolation to estimate wo between 2500 
and 10,000 AA. For sequences exceeding 10,000 AA, the ladderpath calculation becomes 
too time-consuming, leading us to use a rough approximation. Further details can be 
found in SI section 8. 
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4.2. Data availability 


All generated data and associated codes are available on GitHub: 
https://github.com/yuernestliu/LadderpathCalculator. All protein sequence data were 
obtained from the UniProt database. The six proteomes (release 202304) used 
in this paper were indexed by the following Proteome IDs and Taxonomy IDs: 
UP000000589_10090, UP000000625_83333, UP000001940_6239, UP000002311_559292, 
UP000005640_9606, UP000006548_3702. The data associated with the IDR analysis 
were obtained from the DisProt database version 9.4 (release 2023_06). 


4.8. Identifying IDRs 


For Fig. 5d, if for a protein sequence, the ratio of the consensus region to the total 
length is over 25%, we say that this protein contains a significant proportion of IDRs. 
For Fig.5e, we applied the disorder predictor software Metapredictor on the proteomes 
of the six species H. sapiens (Human), M. musculus (Mouse), A. thaliana (Mouse-ear 
cress), C. elegans, S. cerevisiae (Yeast), and E. coli. For each protein sequence, we 
used the command-line tools of Metapredictor and obtained the disorder scores for each 
amino acid. Those amino acids were labelled as “disordered” if the score was over 0.5 
(the default value from the software). Then, if the ratio of disordered amino acids is 
over 25%, we say that this protein contains a significant proportion of IDRs. 
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